revised version of mini-project 02 goes here
getwd()
## [1] "C:/Users/Tobias/Documents/Courses FloridaPoly/Semester 3/Data Visualisation and Reproducible Research/Projects/dataviz_final_project(1)/dataviz_final_project-main/project-02"
#Downloads were made manually and then put into the data folder
# Florida Lakes for Spatial Visualization
lakes <- st_read(here("data", "Florida_Lakes", "Florida_Lakes.shp"))
## Reading layer `Florida_Lakes' from data source
## `C:\Users\Tobias\Documents\Courses FloridaPoly\Semester 3\Data Visualisation and Reproducible Research\Projects\dataviz_final_project(1)\dataviz_final_project-main\data\Florida_Lakes\Florida_Lakes.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 4243 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.42774 ymin: 25.02625 xmax: -80.03097 ymax: 31.00254
## Geodetic CRS: WGS 84
# FIFA 18 player stats for interactive Visualization
fifa <- read_csv(here("data", "fifa18.csv"))
## Rows: 17076 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, nationality, club
## dbl (37): age, overall, potential, acceleration, aggression, agility, balanc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(fifa)
## # A tibble: 6 × 40
## name nationality club age overall potential acceleration aggression
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cristiano R… Portugal Real… 32 94 94 89 63
## 2 L. Messi Argentina FC B… 30 93 93 92 48
## 3 Neymar Brazil Pari… 25 92 94 94 56
## 4 L. Suárez Uruguay FC B… 30 92 92 88 78
## 5 M. Neuer Germany FC B… 31 92 92 58 29
## 6 R. Lewandow… Poland FC B… 28 91 91 79 80
## # ℹ 32 more variables: agility <dbl>, balance <dbl>, ball_control <dbl>,
## # composure <dbl>, crossing <dbl>, curve <dbl>, dribbling <dbl>,
## # finishing <dbl>, free_kick_accuracy <dbl>, gk_diving <dbl>,
## # gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## # gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## # jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>,
## # penalties <dbl>, positioning <dbl>, reactions <dbl>, short_passing <dbl>, …
#Houses in West Roxbury
houses <- read_csv(here("data", "WestRoxbury.csv"))
## Rows: 5802 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): REMODEL
## dbl (13): TOTAL VALUE, TAX, LOT SQFT, YR BUILT, GROSS AREA, LIVING AREA, FLO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Since I’m a big football enthusiast and have played the FIFA video game series for many years, I know that the “pace” stat is crucial for a player to be considered usable or even “meta”, which means top-tier, in the game. For better understanding, in the online mode of the FIFA Games (called Fifa Ultimate Team), each user builds their own squad with players they can link together based on chemistry (League, Club, Nationality). Each player has different stats. With this visualization, I want to explore how speed compares among the highest-rated players and gain interactive insights into how many of them are potentially impactful in gameplay.
My first idea is to visualize the largest lakes in Florida, focusing on Polk County. I wanted to see which lakes have the largest surface area, especially since I recently went on a trail run along Lake Hancock. That made me curious how it compares to others in the county and sparked my interest in exploring the lake topography.
For the final visualization, I wanted to explore how certain home characteristics might influence housing prices. Something that would be interesting to investigate is how the size of a home, affects its value.
#Prepare the dataset for 30 best rated players with relevant pace attributes annd player information
fifa_top30 <- fifa %>%
arrange(desc(overall)) %>%
slice(1:30) %>%
select(name, nationality, club, age, overall, sprint_speed, acceleration, stamina)
fifa_top30
## # A tibble: 30 × 8
## name nationality club age overall sprint_speed acceleration stamina
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cristiano … Portugal Real… 32 94 91 89 92
## 2 L. Messi Argentina FC B… 30 93 87 92 73
## 3 Neymar Brazil Pari… 25 92 90 94 78
## 4 L. Suárez Uruguay FC B… 30 92 77 88 89
## 5 M. Neuer Germany FC B… 31 92 61 58 44
## 6 R. Lewando… Poland FC B… 28 91 83 79 79
## 7 De Gea Spain Manc… 26 90 58 57 40
## 8 E. Hazard Belgium Chel… 26 90 87 93 79
## 9 T. Kroos Germany Real… 27 90 52 60 77
## 10 G. Higuaín Argentina Juve… 29 90 80 78 72
## # ℹ 20 more rows
p <- ggplot(fifa_top30, aes(x = sprint_speed, y = overall,
color = nationality,
text = paste("Name:", name,
"<br>Club:", club,
"<br>Nation:", nationality,
"<br>Sprint Speed:", sprint_speed,
"<br>Acceleration:", acceleration,
"<br>Stamina:", stamina,
"<br>Overall:", overall))) +
geom_point(size = 2, alpha = 0.75) +
labs(title = "Top 30 FIFA 18 Players: Sprint Speed vs Overall Rating",
subtitle = "Colored by nationality | Hover for detailed attributes",
caption="Data: EA Sports",
x = "Sprint Speed",
y = "Overall Rating",
color = "Nationality") +
theme_minimal()
ggplotly(p, tooltip = "text")
# Prepare: issues convert it into metric system, instead of coordinate reference system with degrees
lakes_m <- st_transform(lakes, crs = 5070)
# bring the areas to m²
lakes_m$area_m2 <- st_area(lakes_m)
#filter Polk County select top 10 largest lakes
lakes_polk_top10 <- lakes_m %>%
filter(COUNTY == "POLK") %>%
slice_max(SHAPEAREA, n = 10)
lakes_polk_top10
## Simple feature collection with 10 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1380359 ymin: 621996.5 xmax: 1444512 ymax: 667924.2
## Projected CRS: NAD83 / Conus Albers
## PERIMETER NAME COUNTY OBJECTID SHAPEAREA SHAPELEN area_m2
## 1 21529.93 Lake Weohyakapka POLK 2700 30549713 21529.93 30549713 [m^2]
## 2 19950.41 Lake Hancock POLK 2703 18549186 19950.40 18549186 [m^2]
## 3 18557.39 Lake Rosalie POLK 2701 18358611 18557.39 18358611 [m^2]
## 4 34664.94 Crooked Lake POLK 2690 16885793 34664.94 16885793 [m^2]
## 5 31582.90 Lake Pierce POLK 1937 15604508 31582.90 15604508 [m^2]
## 6 20955.50 Lake Arbuckle POLK 3907 15412988 20955.50 15412988 [m^2]
## 7 16652.12 Reedy Lake POLK 76 14191962 16652.12 14191962 [m^2]
## 8 21007.95 Lake Marion POLK 1936 12226186 21007.95 12226186 [m^2]
## 9 16855.88 Lake Hamilton POLK 67 8668138 16855.88 8668138 [m^2]
## 10 21815.34 Lake Parker POLK 1608 8533253 21815.34 8533253 [m^2]
## geometry
## 1 MULTIPOLYGON (((1437081 641...
## 2 MULTIPOLYGON (((1393835 646...
## 3 MULTIPOLYGON (((1436798 653...
## 4 MULTIPOLYGON (((1422458 631...
## 5 MULTIPOLYGON (((1426426 652...
## 6 MULTIPOLYGON (((1439228 627...
## 7 MULTIPOLYGON (((1427767 629...
## 8 MULTIPOLYGON (((1420520 667...
## 9 MULTIPOLYGON (((1409616 662...
## 10 MULTIPOLYGON (((1382467 658...
ggplot(lakes_polk_top10) +
geom_sf(aes(fill = as.numeric(area_m2)), color = "black", alpha =0.9, size=0.2)+
scale_fill_viridis_c(option = "inferno",
name = "Lake Area in m²",
labels = scales::label_number(scale = 1e-6, suffix = "M"),
breaks = scales::breaks_pretty(n = 5))+
labs(title = "Top 10 Largest Lakes in Polk County, Florida",
subtitle = "Based on surface area",
caption = "Data: Florida Department of Environmental Protection") +
theme_minimal()+
theme(axis.title = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10),
legend.position = "bottom")+
geom_sf_label(aes(label = NAME), size = 2.5, color = "black") +
coord_sf(crs = "+proj=aea +lat_1=24 +lat_2=31 +lat_0=23 +lon_0=-84")
colnames(houses)
## [1] "TOTAL VALUE" "TAX" "LOT SQFT" "YR BUILT" "GROSS AREA"
## [6] "LIVING AREA" "FLOORS" "ROOMS" "BEDROOMS" "FULL BATH"
## [11] "HALF BATH" "KITCHEN" "FIREPLACE" "REMODEL"
head(houses)
## # A tibble: 6 × 14
## `TOTAL VALUE` TAX `LOT SQFT` `YR BUILT` `GROSS AREA` `LIVING AREA` FLOORS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 344. 4330 9965 1880 2436 1352 2
## 2 413. 5190 6590 1945 3108 1976 2
## 3 330. 4152 7500 1890 2294 1371 2
## 4 499. 6272 13773 1957 5032 2608 1
## 5 332. 4170 5000 1910 2370 1438 2
## 6 337. 4244 5142 1950 2124 1060 1
## # ℹ 7 more variables: ROOMS <dbl>, BEDROOMS <dbl>, `FULL BATH` <dbl>,
## # `HALF BATH` <dbl>, KITCHEN <dbl>, FIREPLACE <dbl>, REMODEL <chr>
#no real dataset preparation needed
ggplot(houses, aes(x = `LIVING AREA`, y = `TOTAL VALUE`))+
geom_point(alpha = 0.3, color = "darkblue", size=1)+
geom_smooth(method = "lm", se = TRUE, color = "red")+
labs(title = "Effect of Living Area on Housing Price in West Roxbury",
subtitle = "Method used: Linear regression",
caption = "Data: West Roxbury Housing Dataset",
x = "Living Area in sqft",
y = "Total Value in 1,000s USD")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The following section addresses the questions for the report. With a subheader for each visualisation.
This mini-project 2 explores three different real-world datasets through different types of visualizations: interactive, spatial, and model-based. The goal is to gain insights into diverse categories:videogames, geography, and real estate. Each visualization was chosen based on personal interest and its potential to reveal meaningful insights.
Side Note Regarding FIFA 18: For better understanding, in the online mode of the FIFA Games (called Fifa Ultimate Team), users build their own squads by linking players together based on chemistry (League, Club, Nationality).Pace, consisting of Sprint speed and acceleration, is one of the most valued attributes, as it significantly affects in-game performance and usability.
1.Interactive Visualization: A scatter plot of FIFA 18 player stats, focusing on overall rating and sprint speed. The goal was to explore how pacey the top players of FIFA 18 are. To give users insight on how well players might fit their online squad taht rey are building. The chart includes interactivity, allowing to hover over points to view general information about each player, such as name, nationality, club (which is relevant for the chemistry), as well as the pace related stats acceleration, and stamina. Points are color coded by nationality. The FIFA 18 player statistics, the dataset contained a large number of players and variables. To focus the analysis, the top 30 players were selected based on their overall rating. Attributes related to pace, specifically sprint speed, acceleration, and stamina, were selected and set as important attributes. The data was formatted to allow proper interactivity, including tooltips displaying nationality and player specific information.
2. Spatial Visualization:A map of the 10 largest lakes by area in Polk County, Florida. The map was designed to explore their geographic distribution and relative sizes. Each lake is color-filled according to its surface area, and labeled with its name to improve orientation and recognition. The goal was to understand how these major lakes are distributed across the county and which ones have the largest extent. For the spatial visualization, the original lake data used a geographic coordinate system, which measures in degrees and is not properly for easy to understand areas To fix this, it has been transformed into a projected system and calculated the lake areas in square meters using st_area(). The dataset then has been filtered for lakes in Polk County and selected the 10 largest by area for the final map. For the model-based visualization using the West Roxbury housing dataset, only minimal data preparation was necessary. Especially, making sure of right column names. The dataset included living area (in square feet) and total home value (in U.S. dollars), making it suitable for regression analysis. Column names containing spaces were accessed using backticks to ensure functionality. The data was then used to explore the relationship between home size and property value through a scatter plot with a fitted regression line.
3. Model-Based Visualization: A regression plot showing the relationship between living area and housing prices in West Roxbury, using a linear trend line. The goal was to understand how home size influences value and whether a clear linear association can be observed.
1. Spatial Visualization: The map reveals that the largest lakes in Polk County are scattered across different areas, with a noticeable concentration in the eastern part of the county. The largest among them is Lake Weohyakapka. Including the lake names added helpful geographic context, but also posed visual challenges—some labels overlapped or obscured the lakes beneath them. This required several iterations with label size, positioning, and map theme to find a readable balance. Further another challenge was in the data preparation, the original coordinate system used degrees as units, which are not suitable for displaying surface areas (it would be degrees²). To ensure accurate area values in square meters, I had to transform the data to a coordinate reference system, which was complicated to find out. While the current map provides a clear overview, a future improvement could be to add a location map of Florida to better situate Polk County within the state and put landmarks or cities within to further increase orientation.
2.Interactive Visualization: This plot helps identify whether top-rated players in FIFA 18 also possess strong sprint speed, a stat that is crucial for in-game performance and often a deciding factor when users select players for a team. The interactive tooltips allow further exploration of individual attributes like pace related acceleration and stamina or general team building related information such as nation, name and club. The goal was to see players who are both highly rated overall and exceptionally fast, so they can be used in the users team. The visualization reveals that some of the highest-rated players, such as Mats Hummels or Toni Kroos, may not be ideal for gameplay due to their lack of pace. Conversely, players like Aubameyang, who has a lower overall rating (88) but exceptional sprint speed (96), are often preferred over slower but higher-rated players like Luis Suárez (92 overall, 77 sprint speed), because they make a greater impact in fast-paced gameplay. This view offers an engaging way to evaluate which players are actually viable in-game, beyond their overall rating. A valuable future extension could involve comparing attributes like sprint speed with passing for midfielders, defending for defenders, and shooting for attackers depending on a player’s position. This would help identify well-rounded players and position-specific viable options, and offer more useful insights into gameplay decisions. A key challenge was finding the balance between clarity and detail, too many players could have overwhelmed the viewer, so finding the right amount of player to display with only the relevant information required some careful testing.
3. Model-Based Visualization: This scatter plot with a regression line shows a positive relationship between living area and housing price, confirming the expected trend, larger homes tend to cost more. However, the spread of the data suggests other influencing factors beyond size, such as location, condition, or year built. One limitation was the lack of clearly labeled units in the dataset. I assumed square feet for living area and USD for price, which seems reasonable based on the value ranges. Further refinement could include multiple regression analysis including more factors, or doing a segmentation of the market by year built.
Throughout the project, I focused on clarity, readability, and visual storytelling. For the spatial visualization of lakes, I used an appropriate color gradient to represent surface area, added lake labels for geographic context and information, and chose a projection that preserved relative area sizes. I adjusted text placement and border color to maintain contrast and neatness. In the interactive FIFA scatter plot, I applied mapping of attributes, sprint speed and overall rating were plotted on axes to reveal meaningful patterns and relevant insight for users. Color encoding by nationality added another layer of insight, while tooltips provided further context without making the chart overloaded. For the regression plot of housing prices, I emphasized the relationship between variables using a clean scatterplot with a fitted trend line and minimal theme. Axis labels and units were clearly defined to ensure interpretability. Across all visualizations, I applied consistency in color usage, labeling, and layout, and ensured each plot communicated its main message effectively without unnecessary complexity.
This project 2 showed how different types of visualizations, spatial, interactive, and model-based, can each reveal unique insights from diverse datasets. By applying key design principles and incorporating personal interest, the visualizations became engaging and meaningful. The process also highlighted the importance of thoughtful variable selection, clarity, and context in effective data storytelling.